Skip to content

🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bentito
Copy link
Contributor

@bentito bentito commented Feb 19, 2025

Description

There’s potential for flakiness due to timing issues. For example, the TestOperatorControllerMetricsExportedEndpoint assumes that:
• The ClusterRoleBinding and token are immediately effective.
• The curl pod becomes ready promptly.
• The metrics endpoint is available as soon as the pod is ready.

In slower or variable environments, these assumptions might not hold, leading to intermittent failures. Adding explicit waits or retry logic for each resource’s readiness could help mitigate race conditions.

To fix this, this PR:

• Adds a generic retry helper that attempts a function up to 5 times with a 5-second delay between tries.
• Wraps key steps in the metrics endpoint tests with the retry logic:
• Creating the ClusterRoleBinding.
• Generating the ServiceAccount token.
• Creating the curl pod.
• Validating the metrics endpoint via curl.
• Refactored these steps to return errors instead of calling require.NoError immediately, so transient issues are retried before ultimately failing.
• This change minimizes test flakiness due to race conditions or temporary resource unavailability.

Currently on main, on my local machine, I ran this script:

#!/bin/bash

pass_count=0
total=10

for i in $(seq 1 $total); do
    echo "Running test iteration $i/$total..."
    if make test-e2e > /dev/null 2>&1; then
        echo "Iteration $i: PASS"
        ((pass_count++))
    else
        echo "Iteration $i: FAIL"
    fi
done

echo "Summary: Passed $pass_count/$total times."

Failed on the 2nd try on main:

$ ./scripts/test_e2e_10_times.sh
Running test iteration 1/10...
Iteration 1: PASS
Running test iteration 2/10...
Iteration 2: FAIL
Running test iteration 3/10...
^C

I'm running the script now on this branch and will update with my findings.
I gotta make the script report more, to make sure this is the e2e actually causing the fail, but looks like no joy :-(

run script on branch:

Running test iteration 1/10...
Iteration 1: PASS
Running test iteration 2/10...
Iteration 2: PASS
Running test iteration 3/10...
Iteration 3: FAIL
Running test iteration 4/10...
Iteration 4: PASS
Running test iteration 5/10...
Iteration 5: FAIL
Running test iteration 6/10...

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@bentito bentito requested a review from a team as a code owner February 19, 2025 21:25
Copy link

netlify bot commented Feb 19, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 493315f
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67be3b90876b3a0008825ec5
😎 Deploy Preview https://deploy-preview-1795--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bentito
Copy link
Contributor Author

bentito commented Feb 19, 2025

Tries to fix #1781

Copy link

codecov bot commented Feb 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.43%. Comparing base (7040ee2) to head (493315f).
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1795      +/-   ##
==========================================
+ Coverage   68.34%   68.43%   +0.09%     
==========================================
  Files          63       62       -1     
  Lines        5117     5117              
==========================================
+ Hits         3497     3502       +5     
+ Misses       1390     1387       -3     
+ Partials      230      228       -2     
Flag Coverage Δ
e2e 51.78% <ø> (+0.15%) ⬆️
unit 55.89% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joelanford
Copy link
Member

The ClusterRoleBinding and token are immediately effective.
The curl pod becomes ready promptly.
The metrics endpoint is available as soon as the pod is ready.

Are there examples in the flakes for each of these assumptions not holding? If so, I think we should look at each one to understand what's happening and then decide what the correct fix is.

@bentito bentito changed the title 🐛 Add retries and checks to catch race-driven panics 🐛 Add checks to catch race-driven panics in metrics endpoint e2e Feb 20, 2025
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bentito,

Terrific work! 🚀 This is looking great.

Just a couple of minor suggestions:

  • The PR description and commit history seem to be a bit out of sync with the latest changes. It would be great to update the description accordingly and squash the commits to maintain a clean history.
  • The 30s timeout might be a bit short, especially considering that the --force option is used here: 🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795 (comment) Since the standard timeout across tests is 1 minute, it might be better to align with that for consistency.
    Other than that, everything looks good to me! 👍

Thanks for your work on this. 🎉

Comment on lines +172 to +184
// Wait for the ClusterRoleBinding to be deleted.
if err := waitForDeletion(ctx, c.client, "clusterrolebinding", c.clusterBinding); err != nil {
c.t.Logf("Error waiting for clusterrolebinding deletion: %v", err)
} else {
c.t.Log("ClusterRoleBinding deleted")
}

// Wait for the Pod to be deleted.
if err := waitForDeletion(ctx, c.client, "pod", c.curlPodName, "-n", c.namespace); err != nil {
c.t.Logf("Error waiting for pod deletion: %v", err)
} else {
c.t.Log("Pod deleted")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to even wait for these to be deleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, good point.
But I am betting that is the fix.
In the scenarios where it fails, because the Pod get stuck, we are now just logging in instead of failing.

Copy link
Contributor Author

@bentito bentito Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear: I have a pretty quick test ((takes about ~50m) -- run e2e 10 times. This test always fails on my local box when run on main, at least one fail out of 10, usually 3-5 fails. With the changes on this PR, that same test passes. I think @camilamacedo86 is right, the change in behavior is key to the fix, that we log instead of fail. I guess I could try not waiting, but I'd rather move on as something is fixing the flake.

@tmshort
Copy link
Contributor

tmshort commented Feb 25, 2025

There’s potential for flakiness due to timing issues. For example, the test assumes that:
	•	The ClusterRoleBinding and token are immediately effective.
	•	The curl pod becomes ready promptly.
	•	The metrics endpoint is available as soon as the pod is ready.
Seems like these all can be covered by retry logic in the validate method.

Signed-off-by: Brett Tofel <[email protected]>
@bentito bentito force-pushed the add-retry-validate-metrics-e2e branch from 1e0ac4c to 493315f Compare February 25, 2025 21:52
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2025
@LalatenduMohanty
Copy link
Member

/hold for @joelanford @camilamacedo86 or @azych to review

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2025
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bentito,

This might not fully address the root cause—why was it not possible to delete? The fix mainly occurs because we are logging instead of failing, as discussed in this comment.

That said, I don’t see much value in spending more effort on it at this point. You did a great job addressing the intermittent scenario, and we can always revisit this if needed in the future. 🥇 I don’t see why we shouldn't move forward with this approach. If further tests show that the Pod/ClusterBinding isn’t being cleaned up properly in some cases due to another reason than timeout, or when we add more tests, we face intermittent issues, then yes, we might need to look further, but now, it looks great to me! 👍

/lgtm
/approve

@azych
Copy link
Contributor

azych commented Feb 26, 2025

I share the sentiment - I'm still unsure about the root cause, but if this works and other people are good with it, I'm OK too.
We can always revisit.

/lgtm

@azych azych removed their assignment Feb 26, 2025
@bentito bentito added this pull request to the merge queue Feb 26, 2025
Merged via the queue into operator-framework:main with commit 1573846 Feb 26, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants